Method and apparatus for using a previous column pointer to read entries in an array of a processor

ABSTRACT

A method and apparatus are described for using a previous column pointer to read a subset of entries of an array in a processor. The array may have a plurality of rows and columns of entries, and each entry in the subset may reside on a different row of the array. A previous column pointer may be generated for each of the rows of the array based on a plurality of bits indicating the number of valid entries in the subset to be read, the previous column pointer indicating whether each entry is in a current column or a previous column. The entries in the subset may be read and re-ordered, and invalid entries in the subset may be replaced with nulls. The valid entries and nulls may then be outputted.

FIELD OF INVENTION

This application is related to the design of a processor.

BACKGROUND

Dedicated pipeline queues have been used in multi-pipeline execution units of processors in order to achieve faster processing speeds. In particular, dedicated queues have been used for execution (EX) units having multiple EX pipelines that are configured to execute different subsets of a set of supported micro-instructions. Dedicated queuing has generated various bottlenecking problems and problems for the scheduling of microinstructions that required both numeric manipulation and retrieval/storage of data.

Processors are conventionally designed to process operations (Ops) that are typically identified by operation codes (OpCodes), (i.e., instruction codes). In the design of new processors, it is important to be able to process all of a standard set of Ops so that existing computer programs based on the standardized codes will operate without the need for translating Ops into an entirely new code base. Processor designs may further incorporate the ability to process new Ops, but backwards compatibility to older instruction sets is often desirable.

Execution of micro-instructions/Ops is typically performed in an execution unit of a processor. To increase speed, multi-core processors have been developed. Furthermore, to facilitate faster execution throughput, “pipeline” execution of Ops within an execution unit of a processor core is used. Cores having multiple execution units for multi-thread processing are also being developed. However, there is a continuing demand for faster throughput for processors.

One type of standardized set of Ops is the instruction set compatible with “x86” chips, (e.g., 8086, 286, 386, and the like), that have enjoyed widespread use in many personal computers. The micro-instruction sets, such as the “x86” instruction set, include Ops requiring numeric manipulation, Ops requiring retrieval and/or storage of data, and Ops that require both numeric manipulation and retrieval/storage of data. To execute such Ops, execution units within processors have included two types of pipelines: arithmetic logic pipelines (“EX pipelines”) to execute numeric manipulations, and address generation (AG) pipelines (“AG pipelines”) to facilitate load and store Ops.

In order to quickly and efficiently process Ops as required by a particular computer program, the program commands are decoded into Ops within the supported set of microinstructions and dispatched to the execution unit for processing. Conventionally, an OpCode is dispatched that specifies the Op/micro-instruction to be performed along with associated information that may include items such as an address of data to be used for the Op and operand designations.

Dispatched instructions/Ops are conventionally queued for a multi-pipeline scheduler queue of an execution unit. Queuing is conventionally performed with some type of decoding of a micro-instruction's OpCode in order for the scheduler queue to appropriately direct the instructions for execution by the pipelines with which it is associated within the execution unit.

The processing speed of the execution unit may be affected by the operation of any of its components. For example, any delay in scheduling of the instructions may adversely affect the overall speed of the execution unit.

SUMMARY OF EMBODIMENTS

A method and apparatus are described for using a previous column pointer to read a subset of entries of an array in a processor. The array may have a plurality of rows and columns of entries, and each entry in the subset may reside on a different row of the array. A previous column pointer may be generated for each of the rows of the array based on a plurality of bits indicating the number of valid entries in the subset to be read, the previous column pointer indicating whether each entry is in a current column or a previous column.

Each of the entries may include a physical register number (PRN). A row pointer may be used to indicate a first entry of the subset on a specific row of the array. The bits having a select logic value may be shifted together, and then the shifted bits may be rotated based on the row pointer. A new row pointer may be generated based on the rotated bits. The entries in the subset may be read and re-ordered, and invalid entries in the subset may be replaced with nulls. The valid entries and nulls may then be outputted.

A processor may include a decode unit configured to generate a plurality of bits, and an array having a plurality of rows and columns of entries, each entry in the subset residing on a different row of the array. A previous column pointer may be generated for each of the rows of the array based on the bits to indicate whether each entry is in a current column or a previous column.

A computer-readable storage medium may be configured to store a set of instructions used for manufacturing a semiconductor device. The semiconductor device may comprise the decode unit and the array described above. The instructions may be Verilog data instructions or hardware description language (HDL) instructions.

A computer-readable storage medium may be configured to store data for using a previous column pointer to read a subset of entries of an array having a plurality of rows and columns of entries where each entry in the subset resides on a different row of the array, by generating a previous column pointer for each of the rows of the array based on a plurality of bits indicating the number of valid entries in the subset to be read, and the previous column pointer indicating whether each entry is in a current column or a previous column.

BRIEF DESCRIPTION OF THE DRAWINGS

A more detailed understanding may be had from the following description, given by way of example in conjunction with the accompanying drawings wherein:

FIG. 1 shows an example block diagram of a processor (e.g., a central processing unit (CPU)) including an execution (EX) unit that is configured to read a physical register number (PRN) array;

FIG. 2 shows an example configuration of a PRN array including a plurality of 8-bit entries;

FIG. 3 shows an example of reading the PRN array of the processor of FIG. 1 based on a destination valid signal received for each cycle of the processor;

FIG. 4 shows an example of the configuration of a PRN array having 80 entries;

FIG. 5 shows an example of reading 4 PRNs in a current column and 4 PRNs in a previous column of the PRN array;

FIG. 6 shows an example circuit for reading a first PRN from a first row of the PRN array in a current (third) column;

FIGS. 7-9 show example circuits for reading PRNs from the second, third and fourth rows of the PRN array in the current (third) column;

FIGS. 10-13 show example circuits for reading PRNs from the fifth, sixth, seventh and eighth rows of the PRN array in a previous (second) column when a previous current pointer is activated;

FIG. 14 is a block diagram of an optional PRN array processing circuit for generating an output of the PRN array;

FIG. 15 is a flow diagram of a procedure for using a previous column pointer to read a subset of PRNs from an array; and

FIG. 16 is a block diagram of an example device in which one or more disclosed embodiments may be implemented.

DETAILED DESCRIPTION OF EMBODIMENTS

FIG. 1 shows an example block diagram of a processor 100, (e.g., a central processing unit (CPU)), including an execution (EX) unit 105 and a decode unit 110. The EX unit 105 may include an arithmetic logic unit (ALU) 115 and a scheduler 120. The ALU 115 may include a physical register file (PRF) 125. The scheduler 120 may include a mapper 130 having a physical register number (PRN) array 135, (otherwise known as a freelist macro).

The EX unit 105 is responsible for all integer execution, (including AG), as well as coordination of all instruction retirement and exception handling. The EX unit 105 may be configured to translate all architectural source registers to their current physical registers, and to assign new physical registers to architectural destination registers through a rename and mapping process, using the mapper 130 and the PRN array 135. Physical tags, (i.e., indexes into the PRF 125), may be produced that are used for all subsequent dependency tracking. The PRF entries may be allocated and deallocated out-of-order. Therefore, the free entries need to be tracked in the PRN array 135 (i.e., a freelist structure). The PRN array 135 may store free PRF entries and provide free PRNs to rename destination registers. The PRN array 135 may be a first-in first-out (FIFO) queue that is read in-order with a read pointer. Up to eight (8) PRNs may be read out each cycle of the processor 100. There is a write pointer at the other end of the queue where newly freed PRNs are written. Up to eight (8) newly freed PRNs may be written each cycle.

Eight (8) PRNs may be read every cycle of the processor 100, but they all may not come from the same column. The PRN array 135, (i.e., a freelist macro), may be an array which stores 8 (rows)×10 (columns)=80 free PRNs. Each PRN may include 8 bits.

The EX unit 105 may receive a destination valid signal 140 (8 bits) from the decode unit 110, in a pipe stage, which indicates the number of destinations that were valid in four (4) dispatch packets (2 bits each). A dispatch valid signal (not shown) may be received one cycle later in a mapping pipe stage which indicates whether 1, 2, 3 or 4 dispatch packets were valid. Both these valid signals may be used to determine where (which row) to start reading the PRNs in the cycle. Thus, instead of just using row and column pointers, an additional pointer is needed, (i.e., a previous column pointer), which determines whether it is necessary to read from a current column or a previous column for a particular row.

The scheduler queue 120 determines the order that operations (Ops)/instructions are executed by the EX unit 105. The mapper 130 maps architectural registers, (designated by architectural register numbers (ARNs), to physical registers, (designated by PRNs). The PRN array 135 is used to determine which of the PRNs in the PRF 125 are “free”, (i.e., valid and available for use). At the beginning of each cycle of the processor 100, the decode unit 115 sends a destination valid signal 140 to the ALU 115 and the scheduler 120 that indicates which of a subset of the PRNs stored in entries of the PRF 125 are valid and invalid. As an example, the destination valid signal 140 may have 8 bits, whereby each bit having a logic 1 value indicates a valid PRN, and each bit having a logic 0 value indicates an invalid PRN. The number of valid PRNs indicated by each destination valid signal 140 is provided to the PRN array 135 in the mapper 130 to determine the number of entries in the subset that are valid to be used.

As an example shown in FIG. 2, the PRN array 135 may include 80 entries 205 ₀, 205 ₁, 205 ₂, 205 ₃, . . . , 205 ₇₇, 205 ₇₈ and 205 ₇₉, each including a respective eight (8)-bit PRN P₀, P₁, P₂, P₃, . . . , P₇₇, P₇₈ and P₇₉, which may be considered for reading at a rate of eight (8) entries per cycle of the processor 100.

FIG. 3 shows an example of reading the PRN array 135 in the processor 100 of FIG. 1 based on the destination valid signal 140 received for each cycle of the processor 100. As shown in FIG. 3, in cycle N of the processor 100, the PRNs P₀-P₇ in the PRN 135 are considered for reading. However, in this example, the destination valid signal 140 having a value “10001011” indicates that only four (4) PRNs are valid to be used, and thus only PRNs P₀-P₃ are used in cycle N, and PRNs P₄-P₇ were not used because they were considered to be invalid. In cycle N+1 of the processor 100, the PRNs P₄-P₁₁ in the PRN array 135 are considered for reading. However, in this example, the destination valid signal 140 having a value “10111011” indicates that six (6) PRNs are valid to be used, and thus PRNs P₄-P₉ are used in cycle N+1, and PRNs P₁₀ and P₁₁ were not used because they were considered to be invalid. In cycle N+2 of the processor 100, the PRNs P₁₁-P₁₇ in the PRN array 135 are considered for reading. However, in this example, the destination valid signal 140 having a value “10000000” indicates that only one (1) PRN is valid to be used, and thus PRN P₁₀ is used in cycle N+2, and PRNs P₁₁-P₁₇ were not used because they were considered to be invalid. This process of reading PRNs may continue until all of the PRNs have been read.

One relatively simple way to implement this process would be to use a PRN array with eight (8) read ports and 80 entries. However, this may require a relatively large silicon area on the chip of the processor 100. Furthermore, undesired timing issues and a reduction in the speed of the processor 100 may result.

FIG. 4 shows an example of the configuration of a PRN array 135 having 80 entries, with 8 rows and 10 columns. Each entry stores a PRN (P₀-P₇).

FIG. 5 shows an example of reading eight (8) PRNs at a time. Assuming that, previously, an attempt was made to read PRNs P₁₀-P₁₇, but only two PRNs were determined to be valid (e.g., the destination valid signal was “10010000”), only P₁₀ and P₁₁ would be read in the current cycle. Since P₁₀ is the first PRN to be read and it resides on the third row of the PRN array 135, a row pointer is set to row 3 using a one-hot 8-bit indicator “00100000”. In the next cycle, an attempt to read PRNs P₁₂-P₁₉ is made, whereby four (4) of the PRNs are in a previous column 505 of the PRN 135 and the other four (4) PRNs are in a current column 510 of the PRN array 135. A current column pointer (CCP) is set to column 3, (e.g., using a one-hot 10-bit indicator “0010000000”), a previous column pointer (PCP) is set to 0 for rows 1-4 because the PRNs P₁₆-P₁₉ are in the current column 510, the PCP is set to a logic 1 for rows 5-8 because the PRNs P₁₂-P₁₅ are in the previous column 505, and the row pointer is set to row 5, (e.g., using a one-hot 8-bit indicator “00001000”).

FIG. 6 shows an example circuit 600 for reading a first PRN (P₁₆) from a first row of the PRN array 135 of FIG. 5 in the current (third) column 510. The circuit 600 may include a plurality of multiplexers (MUXes) 6051 ₁-605 ₁₀, each being controlled by the PCP. Since the first PRN (P₁₆) is in the current column 510, the PCP is set to a logic 0 and the CCP is set to column 3, the first PRN (P₁₆) is read via the logic 0 input of MUX 605 ₃ and a wordline (WL) 610.

FIG. 7 shows an example circuit 700 for reading a second PRN (P₁₇) from a second row of the PRN array 135 of FIG. 5 in the current (third) column 510. The circuit 700 may include a plurality of MUXes 705 ₁-705 ₁₀, each being controlled by the PCP. Since the second PRN (P₁₇) is in the current column 510, the PCP is set to a logic 0 and the CCP is set to column 3, the second PRN (P₁₇) is read via the logic 0 input of MUX 705 ₃ and a WL 710.

FIG. 8 shows an example circuit 800 for reading a third PRN (P₁₈) from a third row of the PRN array 135 of FIG. 5 in the current (third) column 510. The circuit 800 may include a plurality of MUXes 805 ₁-805 ₁₀, each being controlled by the PCP. Since the second PRN (P₁₈) is in the current column 510, the PCP is set to a logic 0 and the CCP is set to column 3, the third PRN (P₁₈) is read via the logic 0 input of MUX 805 ₃ and a WL 810.

FIG. 9 shows an example circuit 900 for reading a fourth PRN (P₁₉) from a fourth row of the PRN array 135 of FIG. 5 in the current (third) column 510. The circuit 900 may include a plurality of MUXes 905 ₁-905 ₁₀, each being controlled by the PCP. Since the fourth PRN (P₁₉) is in the current column 510, the PCP is set to a logic 0 and the CCP is set to column 3, the fourth PRN (P₁₉) is read via the logic 0 input of MUX 905 ₃ and a WL 910.

FIG. 10 shows an example circuit 1000 for reading a fifth PRN (P₁₂) from a fifth row of the PRN array 135 of FIG. 5 in the previous (second) column 505. The circuit 1000 may include a plurality of MUXes 1005 ₁-1005 ₁₀, each being controlled by the PCP. Since the fifth PRN (P₁₂) is in the previous column 505, the PCP is set to a logic 1 and the CCP remains set to column 3, the fifth PRN (P₁₂) is read via the logic 1 input of MUX 1005 ₂ and a WL 1010.

FIG. 11 shows an example circuit 1100 for reading a sixth PRN (P₁₃) from a sixth row of the PRN array 135 of FIG. 5 in the previous (second) column 505. The circuit 1100 may include a plurality of MUXes 1105 ₁-1105 ₁₀, each being controlled by the PCP. Since the sixth PRN (P₁₃) is in the previous column 505, the PCP is set to a logic 1 and the CCP remains set to column 3, the sixth PRN (P₁₃) is read via the logic 1 input of MUX 1105 ₂ and a WL 1110.

FIG. 12 shows an example circuit 1200 for reading a seventh PRN (P₁₄) from a seventh row of the PRN array 135 of FIG. 5 in the previous (second) column 505. The circuit 1200 may include a plurality of MUXes 1205 ₁-1205 ₁₀, each being controlled by the PCP. Since the seventh PRN (P₁₄) is in the previous column 505, the PCP is set to a logic 1 and the CCP remains set to column 3, the seventh PRN (P₁₄) is read via the logic 1 input of MUX 1205 ₂ and a WL 1210.

FIG. 13 shows an example circuit 1300 for reading an eighth PRN (P₁₅) from an eighth row of the PRN array 135 of FIG. 5 in the previous (second) column 505. The circuit 1300 may include a plurality of MUXes 1305 ₁-1305 ₁₀, each being controlled by the PCP. Since the eighth PRN (P₁₅) is in the previous column 505, the PCP is set to a logic 1 and the CCP remains set to column 3, the eight PRN (P₁₅) is read via the logic 1 input of MUX 1305 ₂ and a WL 1210.

FIG. 14 is a block diagram of an optional PRN array processing circuit 1400 for generating an output of the PRN array 135. The PRN array processing circuit 1400 may include a sorting logic unit 1405 and a validation logic unit 1410 used to generate a PRN array output 1415. The sorting logic unit 1405 receives the PRNs as they are read by the circuits 600-1300 of FIGS. 6-13 and re-orders the entries in the subset of PRNs such that they are in sequential order. The validation logic unit receives a destination valid signal and generates a PRN array output 1415 including valid PRNs and nulls.

FIG. 15 is a flow diagram of a procedure 1500 for using a previous column pointer to read a subset of PRNs from an array. A plurality of bits are received indicating how many entries in a subset of entries to be read from a PRN array are valid, starting with a first entry on a specific row of the PRN array indicated by a row pointer, the PRN array having a plurality of rows and columns, each entry in the subset residing on a different row of the PRN array (1505). The bits having a logic 1 value are shifted together (1510), and then the bits are rotated based on the row pointer (1515). A previous column pointer is generated for each of the rows of the PRN array based on the rotated bits to indicate whether each entry is in a current column or a previous column (1520). A determination is made, based on the rotated bits, whether a current column pointer needs to be moved such that it points to the current column (1525). A new row pointer is generated based on the rotated bits (1530). The steps 1520, 1525 and 1530 may be performed concurrently. The entries in the subset are then read (1535). Optionally, the entries in the subset may be re-ordered, and the invalid entries in the subset may be replaced with nulls (1540). The valid entries (and nulls) are then output from the PRN array (1545). The procedure 1500 may be continuously repeated starting with step 1505.

FIG. 16 is a block diagram of an example device 1600 in which one or more disclosed embodiments may be implemented. The device 1600 may include, for example, a computer, a gaming device, a handheld device, a set-top box, a television, a mobile phone, or a tablet computer. The device 1600 includes a processor 1602, a memory 1604, a storage 1606, one or more input devices 1608, and one or more output devices 1610. The device 1600 may also optionally include an input driver 1612 and an output driver 1614. It is understood that the device 1600 may include additional components not shown in FIG. 16. The processor 1602 may be configured in a similar fashion to the processor 100 shown in FIG. 1.

The processor 1602 may include a central processing unit (CPU), a graphics processing unit (GPU), a CPU and GPU located on the same die, or one or more processor cores, wherein each processor core may be a CPU or a GPU. The memory 1604 may be located on the same die as the processor 1602, or may be located separately from the processor 1602. The memory 1604 may include a volatile or non-volatile memory, for example, random access memory (RAM), dynamic RAM, or a cache.

The storage 1606 may include a fixed or removable storage, for example, a hard disk drive, a solid state drive, an optical disk, or a flash drive. The input devices 1008 may include a keyboard, a keypad, a touch screen, a touch pad, a detector, a microphone, an accelerometer, a gyroscope, a biometric scanner, or a network connection (e.g., a wireless local area network card for transmission and/or reception of wireless IEEE 802 signals). The output devices 1610 may include a display, a speaker, a printer, a haptic feedback device, one or more lights, an antenna, or a network connection (e.g., a wireless local area network card for transmission and/or reception of wireless IEEE 802 signals).

The input driver 1612 communicates with the processor 1602 and the input devices 1608, and permits the processor 1602 to receive input from the input devices 1608. The output driver 1614 communicates with the processor 1602 and the output devices 1610, and permits the processor 1602 to send output to the output devices 1610. It is noted that the input driver 1612 and the output driver 1614 are optional components, and that the device 1600 will operate in the same manner is the input driver 1612 and the output driver 1614 are not present.

Although features and elements are described above in particular combinations, each feature or element can be used alone without the other features and elements or in various combinations with or without other features and elements. The apparatus described herein may be manufactured by using a computer program, software, or firmware incorporated in a computer-readable storage medium for execution by a general purpose computer or a processor. Examples of computer-readable storage mediums include a read only memory (ROM), a random access memory (RAM), a register, cache memory, semiconductor memory devices, magnetic media such as internal hard disks and removable disks, magneto-optical media, and optical media such as CD-ROM disks, and digital versatile disks (DVDs).

Embodiments of the present invention may be represented as instructions and data stored in a computer-readable storage medium. For example, aspects of the present invention may be implemented using Verilog, which is a hardware description language (HDL). When processed, Verilog data instructions may generate other intermediary data, (e.g., netlists, GDS data, or the like), that may be used to perform a manufacturing process implemented in a semiconductor fabrication facility. The manufacturing process may be adapted to manufacture semiconductor devices (e.g., processors) that embody various aspects of the present invention.

Suitable processors include, by way of example, a general purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, a graphics processing unit (GPU), an accelerated processing unit (APU), a DSP core, a controller, a microcontroller, application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), any other type of integrated circuit (IC), and/or a state machine, or combinations thereof. 

What is claimed is:
 1. A method of using a previous column pointer to read a subset of entries of an array in a processor, the array having a plurality of rows and columns of entries, each entry in the subset residing on a different row of the array, the method comprising: generating a previous column pointer for each of the rows of the array based on a plurality of bits indicating the number of valid entries in the subset to be read, the previous column pointer indicating whether each entry is in a current column or a previous column.
 2. The method of claim 1 wherein each of the entries includes a physical register number (PRN).
 3. The method of claim 1 further comprising: using a row pointer to indicate a first entry of the subset on a specific row of the array.
 4. The method of claim 3 further comprising: shifting the bits having a select logic value together; and rotating the shifted bits based on the row pointer.
 5. The method of claim 4 further comprising: generating a new row pointer based on the rotated bits.
 6. The method of claim 5 further comprising: reading the entries in the subset.
 7. The method of claim 6 further comprising: re-ordering the entries in the subset.
 8. The method of claim 7 further comprising: replacing invalid entries in the subset with nulls.
 9. The method of claim 8 further comprising: outputting the valid entries and nulls.
 10. A processor comprising: a decode unit configured to generate a plurality of bits; and an array having a plurality of rows and columns of entries, each entry in the subset residing on a different row of the array, wherein a previous column pointer is generated for each of the rows of the array based on the bits to indicate whether each entry is in a current column or a previous column.
 11. The processor of claim 10 wherein each of the entries includes a physical register number (PRN).
 12. The processor of claim 10 wherein the bits indicate the number of valid entries in the subset to be read.
 13. The processor of claim 10 wherein a row pointer is used to indicate a first entry of the subset on a specific row of the array.
 14. The processor of claim 13 wherein the bits having a select logic value are shifted together, and the shifted bits are rotated based on the row pointer.
 15. The processor of claim 14 wherein a new row pointer is generated based on the rotated bits.
 15. The processor of claim 14 wherein the previous column pointer is used to read the entries in the subset.
 17. The processor of claim 16 further comprising: a sorting logic unit configured to re-order the entries in the subset.
 18. The processor of claim 17 further comprising: a validation logic unit configured to replace invalid entries in the subset with nulls, and outputting the valid entries and nulls.
 19. A computer-readable storage medium configured to store a set of instructions used for manufacturing a semiconductor device, wherein the semiconductor device comprises: a decode unit configured to generate a plurality of bits; and an array having a plurality of rows and columns of entries, each entry in the subset residing on a different row of the array, wherein a previous column pointer is generated for each of the rows of the array based on the bits to indicate whether each entry is in a current column or a previous column.
 20. The computer-readable storage medium of claim 19 wherein the instructions are Verilog data instructions.
 21. The computer-readable storage medium of claim 19 wherein the instructions are hardware description language (HDL) instructions.
 22. A computer-readable storage medium configured to store data for reading a subset of entries of an array having a plurality of rows and columns of entries where each entry in the subset resides on a different row of the array, by generating a previous column pointer for each of the rows of the array based on a plurality of bits indicating the number of valid entries in the subset to be read, the previous column pointer indicating whether each entry is in a current column or a previous column. 